Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(celery): Prepare to run on multiple queues #19157

Merged
merged 11 commits into from
Jan 17, 2024

Conversation

webjunkie
Copy link
Contributor

Problem

We want to run separate Celery workers that consume a different set of queues each.
See RFC https://github.com/PostHog/product-internal/pull/537

Changes

  • make local and Docker Celery consume from multiple queues
    • I guess we don't want to run multiple consumers, we can just use one combined
  • prepare to have gevent consumer
    • better for I/O-bound work

How did you test this code?

  • tested locally
  • ran gevent consumer and tried async queries (what it will be used for)

Reasoning:
We need to configure Celery workers in several places to consume
from a specific set of queues.
Copy link
Contributor

github-actions bot commented Dec 7, 2023

Size Change: +129 B (0%)

Total Size: 2 MB

ℹ️ View Unchanged
Filename Size Change
frontend/dist/toolbar.js 2 MB +129 B (0%)

compressed-size-action

@@ -0,0 +1,3 @@
# Default set of queues to be used by Celery.
# Important: Add new queues to make Celery consume tasks from them.
CELERY_WORKER_QUEUES=celery,email,insight_export,insight_refresh,gevent
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If is is somehow possible to define this in code, I would 100% vote for that as otherwise I think this file will get forgotten about.

Best case would be to give ourself a change to never miss it, for example by using an enum of sorts so that we define the queues in one place. Then, if the env var isn't set, we set it to a default of all the known queues.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mhm, not super sure what you mean. This was my attempt to define it in one place. The problem I see here is that we start Celery workers, as far as I can tell, for local and hobby deployments in two separate places. Both of these are shell scripts, hence both can use this env file. I also don't really find it nice to be honest. Do you see another way to do it? 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@benjackwhite I'm still not sure how to immediately solve this in a better way. What I did improve though is that we can even use the env file for the Pycharm run config, so we can use this single list of all queues in all places.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should have been clearer with my description.

What I'm wondering is if there is anyway we can do this in python?

So for example:

class CeleryQueue(str, Enum):
    DEFAULT = "celery"
    EMAIL = "email"

all_queues = [e.value for e in CeleryQueue]

CELERY_WORKER_QUEUES=os.getenv("CELERY_WORKER_QUEUES", None) or all_queues

My thinking being that then we codify the default list of queues rather than relying on an external setting (that can easily be forgotten about) and it should be automatically picked up if we added a new queue

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it just isn't possible for one reason or another though then consider it non-blocking and we can push forward

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively - if celery would just default to processing from all queues anyways - why do we need this config at all 😅

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How I see this problem is that Celery is run via shell scripts and we need to pass in the queues there (it apparently cannot consume from "all" – at least not that I found). I agree that having in Python where we also direct the tasks to the queue would be great though.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha. Yeah if its just not possible then I guess this is the solution 😅

@posthog-bot
Copy link
Contributor

This PR hasn't seen activity in a week! Should it be merged, closed, or further worked on? If you want to keep it open, post a comment or remove the stale label – otherwise this will be closed in another week.

@posthog-bot
Copy link
Contributor

This PR hasn't seen activity in a week! Should it be merged, closed, or further worked on? If you want to keep it open, post a comment or remove the stale label – otherwise this will be closed in another week.

@webjunkie webjunkie marked this pull request as ready for review January 8, 2024 11:08
# Conflicts:
#	requirements-dev.txt
#	requirements.in
#	requirements.txt
@webjunkie webjunkie force-pushed the feature/celery-queue-configure branch from 102c5d4 to 40b985c Compare January 16, 2024 13:22
@webjunkie webjunkie merged commit 95fec19 into master Jan 17, 2024
93 checks passed
@webjunkie webjunkie deleted the feature/celery-queue-configure branch January 17, 2024 11:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants